Programming Massively Parallel Processors: A Hands-on Approach: Beyond Linear Arrays: Scaling to Multidimensional Data

Welcome to The Great Handover. In CPU programming, we define how to iterate; in GPGPU, we define what an iteration looks like. This shift from instruction-centric to data-centric logic is powered by the Kernel Abstraction.

1. The global Blueprint

By using the __global__ qualifier, you are not writing a function—you are designing a scalable blueprint. A single kernel execution represents one standalone unit of work, allowing the GPU to orchestrate thousands of identical tasks across its massive core count without manual thread management.

2. The Global Address Resolver

How does a single thread among millions find its target? It uses a deterministic contract known as the indexing formula:

$$\text{threadID} = \text{blockIdx.x} \times \text{blockDim.x} + \text{threadIdx.x}$$

This formula acts as a coordinate system, bridging the software's logical data (the array) to the hardware's physical hierarchy (blocks and threads).

3. Execution Configuration

The <<<B, T>>> parameters define the grid shape. This ensures Transparent Scalability: your code runs identical logic whether the hardware has 2 SMs or 80 SMs.

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary role of the __global__ qualifier?

To define a function that runs on the CPU and is called by the GPU.

To mark a function as a kernel that is callable from the host and executes on the device.

To synchronize all threads across the entire GPU grid.

To allocate memory in the global memory space.

QUESTION 2

If blockIdx.x = 2, blockDim.x = 256, and threadIdx.x = 10, what is the global index?

266

512

522

778

QUESTION 3

What does 'Transparent Scalability' imply in CUDA?

The memory automatically scales with the size of the input array.

The same code can run on different GPUs with varying SM counts without modification.

Threads can see into the registers of other threads.

The kernel speed increases linearly with the clock speed of the CPU.

QUESTION 4

Why is the if (i < n) check necessary in a kernel?

To prevent the GPU from overheating.

To ensure threads do not access memory outside the valid array bounds.

To check if the kernel is running on the correct SM.

To synchronize memory access between threads.

QUESTION 5

Which variable represents the number of threads within a single block?

gridDim.x

blockIdx.x

blockDim.x

threadIdx.x

Case Study: Scaling Vector Addition to 10M Elements

Applying Indexing and Execution Configurations

You are processing a vector of 10 million elements on a GPU with 1024 max threads per block. Your grid must cover exactly 10,000,000 tasks.

1. If you choose a block size of 256 threads, how many blocks must you launch?

Solution:
39,063 blocks. Calculation: ceil(10,000,000 / 256) = 39,062.5, rounded up to the nearest integer.

2. Inside the kernel, how would thread 5 in block 100 calculate its unique data index?

Solution:
index = (100 * 256) + 5 = 25,605. This allows it to access array[25605] directly.

3. What happens to the threads in the very last block if 10M is not perfectly divisible by 256?

Solution:
Some threads in the last block will have a global index >= 10,000,000. These threads must be 'masked' using the boundary guard `if (i < n)` to avoid out-of-bounds memory errors.

1. The __global__ Blueprint

2. The Global Address Resolver

3. Execution Configuration

1. The global Blueprint